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ABSTRACT 

This  paper  treats  the  development  and  use  of  criteria  for  model 
selection,  particularly  for  the  choice  of  the  number  of  groups 
("clusters")  in  the  analysis  of  multivariate  data  and  of  the  number  of 
classes  of  segments  in  the  segmentation  of  time  series  and  digital  images. 

Criteria  such  as  those  of  Akalke,  Schwarz  and  Kashyap  are  considered. 
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INTRODUCTION 

This  article  treats  the  development  and  use  of  model -selection  criteria, 
particularly  for  the  choice  of  the  number  of  clusters  in  multivariate  data 
analysis  and  the  number  of  classes  of  segment  in  the  segmentation  of  time 
series  and  digital  images.  Criteria  such  as  those  of  Akaike  (1973,  1974,  1981), 
Schwarz  (1978)  and  Kashyap  (1982)  are  considered. 

MODEL-SELECTION  CRITERIA 

Consider  the  problem  of  choosing  from  among  a  number  of  models,  indexed 
by  k  (k=  1,2,...,K).  Let  L(k)  be  the  likelihood  given  the  k-th  model. 

Various  model -selection  criteria  taking  the  form 

-2  £n[max  L(k)]  +  a(n)m(k)  +  b(k),  (1) 

*  have  been  developed  in  relatively  recent  years.  Here  n  is  the  sample  size, 
in  denotes  the  natural  logarithm,  max  L(k)  denotes  the  maximum  of  the  likeli¬ 
hood  over  tr.e  parameters,  and  m(k)  is  the  number  of  independent  parameters 
in  the  k-th  mccel .  For  a  given  criterion  a(n)  is  the  cost  of  fitting  an 
additional  para-eter  and  b(k)  is  an  additional  term  depending  upon  the 
criterion  and  tne  ~.cdel  k. 

Akaike,  in  a  very  important  sequence  of  papers,  including  Akaike  (1973, 
1974,  1981),  developed  such  a  criterion  as  an  (heuristic)  estimate  of  the 
expected  entropy  .\ul  1  Pack-Lei  bier  information).  Akaike's  information  cri¬ 
terion  (AIC)  is  of  the  form  (1)  with 

a(n)  =  2  for  all  n,  b(k)  =  0  (AIC).  (2) 

Schwarz  (1978),  working  from  a  Bayesian  viewpoint,  obtained  a  criterion  of 
the  form  (1)  with 

a(n)  =  £nn,  b(k)  =  0  (Schwarz's  criterion).  (3) 

Since,  for  n  greater  than  8,  fnn  exceeds  2,  Schwarz's  criterion  favors  models 
with  fewer  parameters  than  does  Akaike's.  Rissanen  (1978a)  obtained  a  criter¬ 
ion  of  the  form  (1)  as  a  solution  to  a  problem  of  minimum-bit  representation  of 
a  signal.  His  criterion,  for  this  reason  referred  to  as  SDD  (shortest  data 
description) ,  is  given  by 

a(n)  =  tn[(n-2}/24],  b(k)  =  26n(k+l)  (Rissaner's  criterion).  (4) 
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Boekee  and  Buss  (1981)  studied  the  performance  of  several  criteria,  namely 
Rissanen's  and  the  criteria  given  by 

a(n)  =  *h[(n+2)/24],  b(k)  =  0  (5) 

and 

a(n)  =  En(n+2) ,  b(k)  «  0.  (6) 

Note  that  (5)  is  essentially  Schwarz's  criterion.  They  simulated  a  second- 
order  autoregression  with  autoregression  coefficients  -0.8  and  -0.9  for 
n=50,  100,  200  and  400  {fifty  times  for  each  case)  and  found  that  (6),  the 
criterion  wnicn  is  essentially  Schwarz's,  gave  good  results,  better  than 
the  AIC  criterion.  (It  should  be  mentioned  that  the  b(k)  in  (4)  is  specific 
to  the  proc'.em  of  fitting  a  k-tn  order  autoregression.)  The  criteria  (4)  and 
(5)  gave  mec-lcc re  results,  similar  to  AIC.  This  assessment  by  8oekee  and  Buss 
was  based  on  ere  distribution  of  the  orde^  estimate  in  the  simulation  experi¬ 
ments  (the  true  value  being  2,  for  second  order),  for  the  various  criteria. 

Note  that,  of  the  criteria,  only  AIC  has  a(n)  a  constant  function  of  n. 
Various  researchers,  including  Kashyap  (1982),  Rissanen  (1978a ,b)  and  Schwarz 
(1978)  have  mentioned  that  AIC  is  not  consistent;  a(n)  needs  to  depend  upon  n. 
Thus  the  particular  form  of  (1)  chosen  by  Akaike  through  his  heuristic  esti¬ 
mation  argument  may  not  be  best. 

Kashyap  (1932),  also  taking  the  Bayesian  approach,  took  the  asymptotic 
expansion  of  the  logarithm  of  the  posterior  probabilities  a  term  further  than 
dia  Schwarz  and  obtained  the  criterion  given  by 

a(n)  =  dt  n,  b(k)  =  £n[det  B(k)]  (Kashyap's  criterion),  (7) 

where  det  denotes  determinant  and  B(k)  is  the  negative  of  the  matrix  of 
second  partials  of  in  L(k),  evaluated  at  the  maximum  likelihood  estimates. 

*  In  Gaussian  linear  models  this  is  the  covariance  matrix  of  the  maximum  likeli¬ 
hood  estimates  of  the  regression  coefficients;  in  general,  the  expectation  of 
B(k),  evaluated  at  the  i-  parameter  values,  is  Fisher's  information  matrix. 
Since  Kashyap's  criterion  is  based  on  reasoning  similar  to  Schwarz’s,  but 
contains  an  extra  term,  it  could  be  expected  to  perform  better. 

In  what  follows  application  of  Akaike's,  Schwarz's  and  Kashyap's  criteria 
to  various  specific  problems  will  be  discussed.  In  these  applications  often 
the  criteria  agree  (cive  the  same  choice  of  model),  but  in  cases  when  they 
disagree,  AIC  cnoses  the  least  parsimonious  model,  Kashyap's  criterion  the  most 
parsimonious,  Scnwarz's  falling  in  between.  Some  of  the  examples  studied  are 
of  known  structure  (the  correct  model  is  known)  and  in  cases  of  disagreement 
Kashyap's  criterion  did  best  and  Schwarz's  second  best.  Thus,  the  particular 
specifications  put  on  the  form  (1)  by  Akaike  may  not  be  the  best.  Nonetheless, 
the  profession  is  greatly  in  his  debt  for  repeatedly  calling  our  attention 
to  tne  very  important  model -selection  problem. 

APPLICATION  TO  CLUSTERING  AND  SEGMENTATION 
Multi-sample  clustering 

The  problem  of  multi-sample  clustering,  the  grouping  of  samples,  is 
treated  in  8ozdogan  and  Sclove  (1982),  where  numerical  examples  are  given. 

The  situation  is  the  K-sample  problem  (one-way  analysis  of  variance),  with  an 
erohasis  on  grouping  the  samples  into  fewer  than  K  clusters.  The  use  of  mcdel- 


selection  criteria  in  this  situation  can  provide  an  alternative  to  multiple- 
comparison  procedures.  Use  of  model -selection  criteria  avoids  the  difficult 
choice  of  levels  of  significance  in  such  problems.  Here  in  the  Gaussian  case 
with  p  variables  one  has  a  mean  vector  for  each  population.  With  separate 
covariance  matrices,  m(k)  =  k[p  +  p(p+l)/2].  With  a  common  covariance  matrix 
m(k)  *  kp  +  p(p+l)/2.  Model -selection  criteria  can  also  be  used  to  decide 
whether  or  not  to  assume  a  common  covariance  matrix. 

Mixture-model  clustering  of  individuals 

Bozdogan  {1983)  applies  model-selection  criteria  to  the  choice  of  the 
number  of  peculations  in  the  population  mixture  model.  (See,  e.g.,  Wolfe 
1970.)  Here  tnere  are  k-1  independent  mixture  probabilities.  In  the  Gaussian 
case  with  p  . enables  and  different  covariance  matrices,  M(k)  =  k-1  + 
k[p  +  p(p+l).2].  The  algorithm  and  computer  programs  of  Wolfe  (1970)  can  be 
used  to  obtain  the  maximum  likelihood  estimates  for  fixed  k.  Then  model  - 
selection  criteria  can  be  used  to  estimate  k. 

Segmentation  of  Time  Series 

A  model  for  clustering  or  segmentation  is  given  by  assuming  that  each 
instance  of  observation  gives  rise  not  only  to  an  observation  x  but  also  to  a 
label  y,  equal  to  1,  2,  ....  or  k,  where  k  is  the  number  of  classes.  Model - 
selection  criteria  are  used  to  estimate  k.  In  the  context  of  this  model, 
clustering  is  merely  estimation  of  the  labels.  Sclove  (1982a  ,c)  treats  the 
problem  of  segmentation  of  time  series  by  modeling  the  label  process  as  a 
Markov  chain.  An  algorithm  and  computer  programs  are  discussed;  numerical 
examples  are  given.  The  parameters  are  the  transition  probabilities,  the 
marginal  probabi 1 ities  of  the  classes,  and  the  parameters  of  the  class-condi¬ 
tional  densities,  so  m(k)  can  be  taken  to  be  k(k-l)  +  (k-1)  +  c(k),  where 
c(k)  =  k[p  -  ?(p+l)/2]  in  the  Gaussian  case  with  separate  covariance  matrices. 

Segmentation  of  Digital  Images 

Similar  ideas  are  applied  to  digital  images  in  Sclove  (1982a, b).  Here 
the  label  process  is  modeled  as  a  one-sided  Markov  random  field.  In  the 
first-order  case  the  label  of  each  pixel  is  conditioned  on  the  labels  immedi¬ 
ately  to  the  r.ortn  and  west  of  it.  The  number  of  independent  transition 
probabilities  is  k:(k-l).  Further  details  and  examples  are  qiven  in  Sclove 
(1932a). 
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RESUME 

L ’ UTILISATION  DES  CRITERES  PQUR  LA  SELECTION  DES  MODELES  DANS  LAS  REPARTITION 
ET  LA  SEGMENTATION  DES  SERIES  TEMPORELLES  ET  DES  IMAGES  NUMERICALES 

Cet  article  traite  le  developpement  et  1 'utilisation  des  criteres  pour  la 
selection  des  modeles,  surtout  pour  le  choixdu  nombre  des  groupes  (e'est-a- 
dire  des  "clusters")  dans  1 ‘analyse  des  donnees  multidimensionnelles  et  du 
no-bre  des  classes  des  segments  dans  la  segmentation  des  series  temporelles 
et  des  images  numericales.  Les  criteres  comme  ceux  de  Akaike,  Schwarz  et 
Kashyap  y  sont  consideres. 
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